short communications 



Acta Crystallographica Section D 

Biological 
Crystallography 



ISSN 0907-4449 



Nearest-cell: a fast and easy tool for locating crystal 
matches in the PDB 



V. Ramraj, a b * G. Evans, b 

J. M. Diprose 3 and R. M. Esnouf 3 



a The Division of Structural Biology, Wellcome 
Trust Centre for Human Genetics, University of 
Oxford, Oxford OX3 7BN, England, and 
b Diamond Light Source, Harwell Science and 
Innovation Campus, Didcot OX1 1 ODE, England 



Correspondence e-mail: varun@strubi.ox.ac.uk 



When embarking upon X-ray diffraction data collection from a potentially novel 
macromolecular crystal form, it can be useful to ascertain whether the measured 
data reflect a crystal form that is already recorded in the Protein Data Bank and, 
if so, whether it is part of a large family of related structures. Providing such 
information to crystallographers conveniently and quickly, as soon as the first 
images have been recorded and the unit cell characterized at an X-ray beamline, 
has the potential to save time and effort as well as pointing to possible search 
models for molecular replacement. Given an input unit cell, and optionally 
a space group, Nearest-cell rapidly scans the Protein Data Bank and retrieves 
near-matches. 
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1 . Introduction 

X-ray crystallography remains the primary method for the determi- 
nation of the atomic structure of biological macromolecules. At the 
time of writing, more than 80 000 structures form the Protein Data 
Bank (PDB; http://www.wwpdb.org; Westbrook et ah, 2005), of which 
roughly 87% have been solved using X-ray crystallography. 

The rate at which macromolecular crystallography (MX) data sets 
can now be measured at synchrotron-radiation facilities (Winter & 
McAuley, 2011) raises issues relating to the effective use of beamtime. 
Automated tools that allow synchrotron beamline users to be as 
efficient as possible are under continual development (Bahar et ai, 
2006; Keegan & Winn, 2007; Panjikar et ai, 2009; Winter & McAuley, 
2011). The tool described here, Nearest-cell, is a useful addition to this 
automation armoury. 

Somewhat masked by the success of MX, numerous challenges 
remain in protein production, purification and crystallization. This 
is particularly the case for complexes comprising multiple protein 
subunits as well as membrane proteins, where there can be an 
elevated risk of purifying host-system expression byproducts along 
with the target of interest. Given the difficulties associated with 
crystallizing many of these 'high-impact' targets, it is often the case 
that the 'impurity' protein crystallizes more readily. A ready way of 
determining whether a crystal might arise from an impurity, such as 
Nearest-cell, is particularly useful in these situations. 

Nearest-cell has been installed at the MX beamlines at the 
Diamond Light Source and uses output from automated data-analysis 
pipelines such as fast_dp (Winter & McAuley, 2011) to provide users 
with a putative list of similar unit cells (and hence, potentially, 
structures) in the PDB. 



2. Experimental procedures 

Nearest-cell depends on a custom set of software (a pipeline) 
designed to update an internal database. It was written to be executed 
weekly, coinciding with updates of the PDB. 

2.1. Database pipeline 

The pipeline is written in C++; it updates a database of key 
information (PDB ID, organism, experimental method, unit cell, 
space group, R factors) from PDB XML files and consists of software 
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and a database that is used to store 
the necessary information for rapid 
retrieval by Nearest-cell. It uses part 
of the PHENIX software suite 
(Adams et at, 2010), specifically 
phenix. explore _metric_symmetry , to 
pre-compute the reduced symmetry 
PI cell for a given unit cell and space 
group. The pipeline parses the unit 
cell corresponding to space group 
PI from the phenlx.explore_metric_ 
symmetry output and stores it in the 
database. 

The pipeline runs automatically to 
coincide with the PDB update 
schedule and performs the following 
tasks. 

(i) Synchronize a local PDB XML 
repository with the PDBe mirror. 

(ii) Extract key information from 
new or changed PDB entries and add 
it to the database. Purge superseded 
entries. 

(iii) Run phenix. explore_metric_ 
symmetry on each updated PDB 
entry; store PI cell in the database. 

(iv) Generate flat file SEQRES and 
ATOM records for each updated 
PDB XML file. 

2.1.1. Auxiliary pipeline features. 
The pipeline also stores the number 
of space-group symmetry operators 
for each space group. While PHENIX 
can be invoked each time for this 
information as required, it is faster for 
Nearest-cell to retrieve this informa- 
tion from a database. These data are 
used by the family-clustering algo- 
rithm (described below). The pipeline 
also allows the manual curation of 
alternate space groups and indexing 
conventions that occasionally arise in 
the PDB. 

The SEQRES records that are 
generated by the pipeline are simple 

Figure 1 

Schematic showing Nearest-ceWs logic. (1) 
The input cell is first converted to PI if 
required. (2) It is then compared with every 
known PI cell in the PDB using MATFIT 
(McLachlan, 1972; Kabsch, 1976, 1978); the 
schematic in box 2a shows an example 
superposition with one permutation of the 
database PI cell (O' superposed on O, A' on 
A, B' on B and C on C). If the lowest r.m.s. 
difference of all six superpositions is less than 
the specified cutoff (see §2.2.1), the database 
cell qualifies as a positive match. (3) The 
family-clustering algorithm clusters PDB 
entries into families of sequence similarity. 
Results are then displayed to the user with 
each family represented by the PDB entry 
with the smallest r.m.s. difference from the 
input. Families can be expanded to show all 
hits, as shown in Fig. 2. 
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FASTA format files containing descriptive headers and single-letter 
amino-acid sequences for each chain of a PDB entry. The single-letter 
sequence is derived from the three-letter amino-acid code in the PDB 
XML file. To account for nonstandard amino acids, the pipeline is 
able to call a JSON web service developed by the EBI for this specific 
purpose (personal communication with Jose Dana and Sameer 
Velankar of the EBI) to retrieve the appropriate standard amino acid 
for a given nonstandard input. For example, the amino acid seleno- 
methionine, coded in a PDB record as 'MSE', is resolved by the 
JSON web service to 'M' (methionine). Once again, it is advanta- 
geous to retrieve all nonstandard amino-acid mappings in advance, 
since the JSON query is slower and relies on an external server. The 
pipeline has the capacity to pre-fetch and store all of these mappings 
to the database, although this feature need not be run weekly. 

2.2. Nearest-cell 

Nearest-cell is a multi-process capable command-line driven C++ 
application with a Python web service front end. Fig. 1 describes 
the logic underpinning Nearest-cell. When invoked, it calls several 
external applications. 

(i) phenix.explore_metric_symmetry for reducing the query unit 
cell to a PI unit cell. 



(ii) MATFIT, a superposition subroutine (McLachlan, 1972; 
Kabsch, 1976, 1978) described in §2.2.1. 

(iii) CD-HIT (Li & Godzik, 2006), a sequence-clustering method, 
as part of the family-clustering algorithm (§2.2.2). 

2.2.1. MATFIT. This is a Fortran subroutine that calculates the 
rotation matrix and translation vector for the best superposition of 
two sets of atomic position vectors (McLachlan, 1972; Kabsch, 1976, 
1978) and returns an r.m.s. difference. When Nearest-cell compares 
the input PI cell against a pre-computed PDB PI cell in its database, 
it tests all six valid right-handed combinations of axes (since proteins 
are enantiomorphic), running MATFIT each time and choosing the 
smallest r.m.s. difference of the six. If this lowest r.m.s. difference is 
within a cutoff (either specified on the command line or, by default, 
set to the larger of 2.5 A or 1% of the sum of the longest and the 
shortest unit-cell dimensions), the PDB cell qualifies as a positive 
match. Box 2a in Fig. 1 shows one such comparison between the 
query PI cell and a database PI cell. 

2.2.2. CD-HIT. This program (Li & Godzik, 2006) groups amino- 
acid sequences into clusters at a desired level of sequence identity 
(set to 90% of the length of the shortest sequence by default). Each 
cluster is described by a representative sequence. It is used here as 
a preliminary clustering step for the family-clustering algorithm 
described below. 
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See all images.. 



Fast data processing log 

Starting image: /dls/i02/data/2012/cm5697-2/hkldeftest/thai 
Running on; cs04r-sc-comQ6-15 
Nusber of jobs: 18 
Processing inages: 1 -> 900 
Phi range: 90.00 -> 180.00 
Template: thaum_collect2_ljWiW.cbf 
Wavelength: 0 . 97950 
Working in: /dls/i02/data/2012/cro5697-2/processed/hkldeft. 
All autoindexing results 



Lattice a b c alpha beta 

tP 57.85 57.35 150,10 90.00 90,00 

oC 81.80 81.90 150,10 90. 00 90,00 

OP 57.80 57.90 150.10 90.00 90.00 

■C 81.90 31.80 150,10 90.00 90.00 

■P 57.90 57.80 150.10 90.00 90,00 

aP 57.80 57.90 150.10 90.00 90.00 

Mosaic spread: 0.06 < 0.09 < 0.13 

Happy with sg# 89 

57,80 57.80 150.10 90.00 90.00 90,00 



d n -im.j 
90.00 
90.00 
90.00 
90.00 
90.00 
89.90 



Low resolution 28.93 28.93 

High resolution 1.25 5.59 

Rmerge 0.044 0.043 

I/sigma 22.20 36.20 

Completeness 81.6 99.2 

Multiplicity 5.1 5.4 

Anom. Completeness 72.7 98.6 

Anom. Multiplicity 2.8 3.3 

Mid-slope 0. 945 



1.28 
1.25 
0.104 
3.00 
8.7 
1.1 



Merging point group: P 4 2 2 

Unit cell: 57.86 57.86 150.12 90.00 90.00 90,00' 
Processing took ooh 02n 06s (126 s) [296345 retlectiojji] 
RPS: 2343.1 



nearest-cell 



A fast ajid easy tool for locating crystal mate 
Expand/collapse Nearest-Cell results 



LR3 (P41212) - Crystal structure of thaumatin at high hydrostatic pressure 
1 RQW (P41 21 2) -- Thaumatin Structure at 1 .05 A Resolution 

3DZR (P41212) ■■ Thaumatin by Classical hanging drop method before high X-Ray dose on ESRF ID29 beamline 

4DJ1 (P41212) -- Thaumatin I by Langmuir-Blodgett Hanging Drop Method at 1.98A resolution for Unique Water Distribution 

4DJ0 (P41 21 2) ■- Thaumatin I by Langmuir-Blodgett Hanging Drop Method at 1 .98A resolution for Unique Water Distribution 

3DZP (P41212) -- Thaumatin by LB nanotemplate method after high X-Ray dose on ESRF 1D29 beamline 

3QY5 (P41212) -- Microfluidic crystallization of Thaumatin using the Crystal Former 

3AOK (P41212) - Crystal structure of sweet-tasting protein thaumatin II 

2BLR (P41212) --THAUMATIN BEFORE A HIGH DOSE X-RAY "BURN" 

2BLU (P41212) -- THAUMATIN AFTER A HIGH DOSE X-RAY "BURN" 

4EKT (P41212) -- Final Thaumatin Structure for Radiation Damage Experiment at 180 K 

2VI3 (P41 21 2) -- ATOMIC RESOLUTION (0.98 A) STRUCTURE OF PURIFIED THAUMATIN I GROWN IN SODIUM DL-TARTRATE AT 20 C 
3N03 (P41212) -- Thaumatin crystals grown from drops 

1 LY0 (P41212) -- Structure of thaumatin crystallized in the presence of glycerol 

2VHK (P41212) -- ATOMIC RESOLUTION (0.94 A) STRUCTURE OF PURIFIED THAUMATIN I GROWN IN SODIUM L-TARTRATE AT 22C 

1LR2 (P41212) -- Crystal structure of thaumatin at high hydrostatic pressure 
1 LXZ (P41212) - Structure of thaumatin crystallized in the presence of glycerol 

2VI4 (P41 21 2) - ATOMIC RESOLUTION (1 .10 A) STRUCTURE OF PURIFIED THAUMATIN I GROWN IN SODIUM DL-TARTRATE AT 6 C 
4DI2 (P41212) - Thaumatin I by Classical Hanging Drop Method at 1.98A resolution for Unique Water Distribution 
2VHR (P41212) -- ATOMIC RESOLUTION (0.95A) STRUCTURE OF PURIFIED THAUMATIN I GROWN IN SODIUM L-TARTRATE AT 4 C 
4EKO (P41 21 2) - Initial Thaumatin Structure for Radiation Damage Experiment at 1 80 K 

4DIY (P41212) - Thaumatin I by Classical Hanging Drop Method at 1.98A resolution for Unique Water Distribution 
3N02 (P41212) - Thaumatic crystals grown in loops/micromounts 
2G4Y (P41 21 2) - structure of thaumatin at 2.0 A wavelength 

3DZN (P41212) - Thaumatin by LB nanotemplate method before high X-Ray dose on ESRF ID29 beamline 
3ALD (P41212) - Crystal structure of sweet-tasting protein Thaumatin I at 1.10 A 
3AL7 (P41212) -- Recombinant thaumatin I at 1.1 A 

4EKH (P41212) - Final Thaumatin Structure for Radiation Damage Experiment at 100 K 
3E0A (P41212) -- Thaumatin by Classical hanging drop method after high X-Ray dose on ESRF ID29 beamline 
3E3S (P41212) - Structure of thaumatin with the magic triangle I3C 

20QN (P41 21 2) -- High Pressure Cryocooling of Capillary Sample Cryopro recti on and Diffraction Phasing at Long Wavelengths 
'BZ (P41212)-- 1.6 A STRUCTURE OF THAUMATIN CRYSTALLIZED WITHOUT TARTRATE AT 4 C 
B (P41212) -- Initial Thaumatin Structure for Radiation Damage Experiment at 100 K 

|l2 (P41212) - ATOMIC RESOLUTION (1.0S A) STRUCTURE OF PURIFIED THAUMATIN I GROWN IN SODIUM D-TARTRATE AT 4C 
(P41212) -- Structure of HYPER-VIL-thaumatin 

(P41212) - Initial Thaumatin Structure for Radiation Damage Experiment at 240 K 
(P41212) -- Thaumatin from Thaumatococcus Danielli in complex with tris-dipicolinate Europium 
(P41212) - Final Thaumatin Structure for Radiation Damage Experiment at 240 K 
Final Thaumatin Structure for Radiation Damage Experiment at 25 K 
Initial Thaumatin Structure for Radiation Damage Experiment at 25 K 
Final Thaumatin Structure for Radiation Damage Experiment at 300 K 
Initial Thaumatin Structure for Radiation Damage Experiment at 300 K 
1 kwn (P41212) - 1.2 A Structure of Thaumatin Crystallized in Gel 

2A7I (P41212) -- On the Routine Use of Soft X-Rays in Macromolecular Crystallography, Part III- The Optimal Data Collection Wavelength 
2D80 (P41212) - Structure of VI L- thaumatin 

1THW (P41 21 2) - THE STRUCTURES OF THREE CRYSTAL FORMS OF THE SWEET PROTEIN THAUMATIN 



Family 2 has 2 members (expand/collapse): 

0.46 3QFX (P41) - Trypanosoma brucei dihydrofolate reductase pyrimethamine complex 
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Figure 2 

Typical output from Nearest-cell, shown as part of Diamond's fast_dp report for a thaumatin unit cell. The results are appended to the end of afast_dp run. Family 1 contained 
46 thaumatin unit cells clustered together, showing the effectiveness of the family-clustering algorithm for reducing the number of results displayed to the user (inset). Note 
that this family contains two exact matches (r.m.s. difference = 0.00 A). 
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2.3. Family-clustering algorithm 

This algorithm was developed to exploit similarity at the sequence 
level to usefully group matches at the PDB-record level, which can 
contain different numbers of chains and/or belong to different space 
groups. This allows Nearest-cell to substantially reduce the output. 
The problem is evident for an input cell matching that of horse heart 
myoglobin (PDB entry 3vau; Yi & Richter-Addo, 2012), for instance, 
which produces 126 hits when run through Nearest-cell. Since most of 
the hits are from the same family (myoglobin), the family-clustering 
algorithm reduces the output to representative PDB IDs for each of 
just five families. The output from a more typical query is shown in 
Fig. 2. 

The basic logic of the algorithm is shown in step 3 of Fig. 1. All 
sequences from all PDB entries with matching PI cells are grouped 
into clusters using CD-HIT. The contents of the asymmetric unit for 
each PDB entry can then be described by how many examples of each 
cluster it contains. This can then be expanded to describe the PI cell 
by multiplying by the number of symmetry operators for the space 
group. Finally, a pair of PDB IDs are clustered together into the same 
family only if the CD-HIT cluster numbers and multiplicities match. 

3. Results and discussion 

Nearest-cell is currently available for public use through the 
web service located at http://www.strubi.ox.ac.uk/nearest-cell/ 
nearest-cell.cgi. 

The web service takes a unit cell as required input. Space group 
is optional, and if not provided is assumed to be PI. In parallel 
computation mode, using two cores on a modern computer, compu- 
tation takes just under 1 s. Across 24 cores, this computation time is 
reduced to 0.3 s. The entire web-service request from start to finish 
takes about 5 s if the space group is PI and about 10 s otherwise (the 
overhead of invoking PHENIX to reduce the unit cell to PI using 
phenix.explore_metric_symmetry). Note that space groups need to be 
in Pf/KWX-accepted format (Adams et al, 2010). The CGI script 
can also be invoked using a GET request with the parameters 
in the URL; for example, http://www.strubi.ox.ac.uk/nearest-cell/ 
nearest-cell.cgi?unit-cell=24,24,24,90,90,90&space-group=R3:R. In 
this way, URLs can be generated programmatically as part of other 
pipelines. This is especially useful for facilities such as the Diamond 
Light Source, where Nearest-cell has been integrated into internal 
pipelines such as fastjip (Fig. 2; Winter & McAuley, 2011). 



4. Conclusion 

The design decision for Nearest-cell was to base match solely on the 
unit-cell dimensions rather than attempting to match (low-resolution) 
structure factors. Although our approach is less selective and gives 
more false positives, it allows Nearest-cell to be run more rapidly and 
directly after unit-cell characterization, thereby informing effective 
use of beamtime. While current PDB search tools do allow a search of 
unit-cell dimensions within given tolerances, this does not provide the 
comprehensive matching provided by Nearest-cell. The more rigorous 
approach of matching structure factors is embodied within molecular- 
replacement strategies such as BALBES (Long et al, 2008). 
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